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1  Overview 

When  a  person  gives  a  task  to  another  person  there  are  at  least  two  sorts  of 
very  human  processes  at  work.  At  the  surface  level,  each  person  both  displays 
and  perceives  cross-cultural  cues  which  regulate  the  interaction.  Through  fa¬ 
cial  expressions,  body  posture,  and  utterances,  the  student  unconsciously 
speeds  or  slows  the  rate  at  which  the  instructor  is  teaching  and  directs  the 
instructor  to  provide  more  information  when  necessary.  At  a  deeper  level, 
the  transfer  of  information  is  successful  because  both  student  and  instructor 
share  a  common  sense  of  how  the  world  works.  Both  student  and  teacher 
share  not  only  knowledge  about  how  objects  behave  (an  intuitive  physics) 
but  also  knowledge  about  how  other  people  behave  (an  intuitive  psychology). 
Our  challenge  is  to  make  commanding  robots  as  intuitive  and  natural  as  com¬ 
manding  professional  soldiers  by  providing  a  natural  and  intuitive  interface 
that  capitalizes  on  a  person’s  intuitive  understanding  of  how  to  communi¬ 
cate,  and  by  instilling  into  robots  that  same  deep  understanding  of  the  world 
which  is  shared  by  people. 
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2  Approach 

We  proposed  developing  the  perceptual  and  intellectual  abilities  of  robots  so 
that  in  the  held,  war-fighters  can  interact  with  them  in  the  same  natural  ways 
as  they  do  with  their  human  cohorts.  To  illustrate  our  goals  and  illuminate 
the  technical  problems  that  we  must  solve  to  achieve  these  goals,  we  will 
outlined  three  scenarios: 

1.  Showing  a  robot  how  to  open  the  gas  tank  of  an  unfamiliar  captured 
enemy  vehicle. 

2.  Instructing  a  robot  to  carry  out  a  reconnaissance  mission  ranging  a  few 
hundred  meters  from  the  command  post  within  a  strife-torn  downtown 
urban  environment. 

3.  Instructing  a  dextrous  forklift-like  robot  to  load  a  truck  by  showing  it 
how  the  particular  bulk  food  sacks  should  be  stacked  together,  one  by 
one. 

Our  approach  was  based  on  two  key  ideas;  imitation  and  social  interac¬ 
tion. 

Imitation  involves  the  robot  watching  and  listening  to  a  person  perform 
some  task  and  then  equivalently  executing  it.  From  its  observations  the  robot 
must  extract  which  aspects  of  the  person’s  motions  and  utterances  are  essen¬ 
tial  to  actually  carrying  out  the  task,  which  are  part  of  the  instruction  but 
not  part  of  the  actual  task,  and  which  are  simply  connective  or  coincidental. 

Social  interaction  involves  the  robot  engaging  a  person  in  the  same  dy¬ 
namic  two-way  communication  processes  which  two  people  could  share.  Each 
participant  gives  the  other  subconscious  cues  that  carry  messages  such  as  “I 
understand  that”,  “you’re  going  too  fast”,  “I  don’t  know  what  you  mean”, 
“I  already  know  that”,  “look  at  what’s  important”,  “no,  it’s  more  like  this”, 
“you  don’t  understand  it”,  and  “now  you  get  it!”.  The  mechanisms  for  these 
signals  are  complex  and  often  interrelated  and  involve  such  indications  as  gaze 
direction,  eye  contact,  averting  eye  contact,  nodding,  body  posture  shifts, 
facial  expressions,  head  motions,  pre-linguistic  verbalizations  (“hmmm”  or 
“uh-huh”),  and  codified  verbalizations  (“Sir!”). 
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The  principles  of  development ,  embodimen  and  integration  contributed 
to  our  approach.  The  process  of  development  wherein  humans  perform  in¬ 
crementally  more  difficult  tasks  in  complex  environments  as  they  mature 
inspires  a  developmental  methodology  for  our  robots.  Embodiment  empha¬ 
sizes  human-like  aspects  of  our  robots’  bodies.  The  integration  of  multiple 
sensory  modalities,  physical  degrees  of  freedom  and  behavioral  systems  all 
a  single  robot  to  imitate  and  interact  with  humans  in  a  more  sophisticated 
manner. 


3  Research  Questions 

In  trying  to  use  imitation  and  social  interaction  techniques  for  human-robot 
communication  and  for  tasking  robots  in  the  field,  there  arise  at  least  six 
deep  and  difficult  questions,  each  of  which  has  many  technical  components 
which  form  the  topics  on  which  we  propose 

1.  Knowing  what  aspects  of  behavior  to  imitate. 

2.  Mapping  from  one  body  to  another. 

3.  Implementing  corrective  actions  and  recognizing  success. 

4.  Chaining  pieces  of  action  together  into  larger  tasks. 

5.  Generalizing  imitated  actions  to  different  and  more  complex  tasks. 

6.  Making  the  interactions  intuitive  for  the  human. 

The  chart  below  displays  examples  of  using  the  principles  of  social  inter¬ 
action,  development,  embodiment  and  system  integration  to  address  the  six 
major  research  questions  we  have  identified. 
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Knowing 
what  to 
imitate 
(1) 

Mapping 

between 

bodies 

(2) 

Correcting 
failures  and 
recognizing 
success 

(3) 

Chaining 

actions 

(4) 

Generalizing 
to  more 
complex 
tasks 
(5) 

Intuitive 

Interactions 

(6) 

Social  Interaction 
(1) 

Use! 

all  coin  mal  cues 
to  recognize  task 
relevant  events 
and  objects 

Identifies  and 
displays 
emotional  states 
for  recognizing 
success 

Allows  humans 
and  robots  to 
share  similar 
social  cues 
without  effort 

Development 

(II) 

I aunts  search 
space  by 
incremental 
refinement  of 
perception 

Simplifies 
mapping  by 
incremental 
refinement  of 
motor  skills 

Provides  natural 
decomposition 
of  complex  tasks 

Exploits 
incremental 
learning  to  huild 
new,  more 
complex  skills 

Provides  an 
established 
framework  for 
building 
social  skills 

Embodiment 

(III) 

Assists  directed 
perception  by 
constraining 
possible 
movements 

Provides  similar 
structure  which 
simplifies 
mapping 
between  bodies 

Enables  simple 
but  robust  low- 
level  behaviors 

Places 
physical 
limits  on 
sequential 
actions 

Allows  the 
human  to 
observe  the 
robot's  natural 
social  cues 

Integration 

(IV) 

Allows  robot  to 
recognize  cues 
in  multiple 
modalities 
(voice,  gesture) 

Increases 

robustness 

through 

multi-channel 

redundancy 

Assists  in 
transfer  to  new 
modalities 

Allows  human  to 
use  natural 
methods  of 
communication 
<  voice,  gesture) 

4  Student  Output 

Students  supported  by  this  effort  for  the  PhD  degrees  have  gone  on  to  a 
number  of  positions: 

•  Artur  M.  Arsenio,  researcher,  Siemens. 

•  Cynthia  Breazeal,  Assistant  Professor  of  Media  Arts  and  Sciences, 

MIT. 

•  Paul  Fitzpatrick,  Lecturer,  Electrical  Engineering  and  Computer  Sci¬ 
ence,  MIT. 

•  Charles  C.  Kemp,  post-doctoral  associate,  CSAIL,  MIT. 

•  Matthew  Marjanovic,  researcher,  ITA  Software. 

•  Brian  Scasselatti,  Assistant  Professor  of  Computer  Science,  Yale. 

•  Matthew  Williamson,  research,  Sana  Security. 

Postdoctoral  students  supported  by  this  effort  have  moved  on  to  new 
positions: 
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•  Martin  C.  Martin,  researcher,  Icosystems. 

•  Giorgio  Metta,  Lecturer,  University  of  Genoa. 

A  number  of  students  who  started  work  under  this  effort  continue  their 
PhD  studies  at  MfT: 

•  Bryan  Adams 

•  Lijin  Aryananda 

•  Jessica  Banks 

•  Aaron  Edsinger-Gonzales 

•  Eduardo  Torres- Jara 

•  Paulina  Varchavskaia 


5  Results  for  1999—2000 

Flexible  Turn- Taking  based  on  eye  contact  and  head  motion.  We 

have  demonstrated  robust  and  flexible  vocal  turn-taking  on  our  robot,  Kismet. 
Kismet  can  engage  in  a  proto-dialog  with  a  single  person  as  well  as  with  two 
people.  Kismet  determines  when  it  should  take  its  turn  based  on  pauses  in 
speech  and  the  current  phase  of  the  turn-taking  interaction.  Through  exper¬ 
iments  with  naive  subjects,  we  have  found  that  people  intuitively  read  the 
robot’s  physical  and  vocal  cues  (change  of  gaze  direction,  shifts  of  posture, 
and  pauses  in  vocalizations)  and  naturally  use  these  cues  to  time  their  own 
response.  As  a  result,  the  proto-dialog  becomes  smoother  over  time,  with 
fewer  accidental  interruptions  or  pauses. 

Detect  Prosody  in  human  speech  and  show  appropriate  facial  re¬ 
sponses.  We  have  demonstrated  a  robust  technique  for  recognizing  affec¬ 
tive  intent  in  robot-directed  speech.  By  analyzing  the  prosody  of  a  person’s 
speech,  Kismet  can  determine  whether  it  was  praised,  prohibited,  soothed, 
or  given  an  attentional  bid.  The  robot  can  distinguish  these  affective  intents 
from  neutral  robot-directed  speech.  The  output  of  the  recognizer  modulates 
the  robot’s  emotional  models,  inducing  an  appropriate  affective  state  with 
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a  corresponding  facial  expression  (an  expression  of  happiness  when  praised, 
sorrow  when  prohibited,  interest  when  alerted,  and  a  relaxed  expression  for 
soothing).  In  multi- lingual  experiments  with  naive  female  subjects,  we  found 
that  the  robot  was  able  to  robustly  classify  the  four  affective  intents.  In  ad¬ 
dition,  the  subjects  intuitively  inferred  when  their  intent  had  been  properly 
understood  by  Kismet’s  expressive  feedback. 

Expressive  feedback  through  face,  voice,  and  body  posture.  We 

have  implemented  expressive  feedback  in  multiple  modalities  on  Kismet.  The 
robot  is  able  to  express  itself  through  voice,  facial  expression,  and  body  pos¬ 
ture.  We  have  evaluated  the  readability  of  Kismet’s  expressions  for  anger,  dis¬ 
gust,  fear,  happiness,  interest,  sorrow,  surprise,  and  some  interesting  blends 
through  numerous  studies  with  naive  human  subjects. 

Visual  attention  and  gaze  direction.  We  have  implemented  a  visual 
attention  system  on  Cog  and  Kismet  based  on  Jeremy  Wolfe’s  model  of  hu¬ 
man  visual  search.  We  have  tested  the  robustness  of  the  attention  system  on 
these  robots.  By  matching  the  robot’s  visual  system  to  what  humans  find  to 
be  inherently  salient,  the  robot’s  attention  is  often  drawn  to  the  same  sorts  of 
stimuli  that  humans  do.  In  studies  with  naive  subjects,  we  found  that  people 
intuitively  use  natural  attention-grabbing  cues  to  quickly  direct  the  robot’s 
attention  (motion,  proximity,  etc.).  The  subject’s  intuitively  use  the  robot’s 
gaze  and  smooth  pursuit  behavior  to  determine  when  they  have  successfully 
directed  the  robot’s  attention. 

Papers  on  this  work  included:  [1],  [47],  [4],  [35],  [30],  [37],  [38],  [85],  [86], 
[48],  [84],  [39],  [40],  [98], 

Presentations  on  this  work  included:  [43],  [31],  [42], 


6  Results  for  2000-2001 

Detecting  Head  Orientation.  We  have  implemented  and  evaluated  a 
system  that  detects  the  orientation  of  a  person’s  head  from  as  far  as  six 
meters  away  from  the  robot.  To  accomplish  this,  we  have  implemented  a 
multi-stage  behavior.  Whenever  the  robot  sees  an  item  of  interest,  it  moves 
its  eyes  and  head  to  bring  that  object  within  the  field  of  view  of  the  foveal 
cameras.  A  face  hireling  algorithm  based  on  skin  color  and  shape  is  used  to 
identify  faces  and  a  software  zoom  is  used  to  capture  as  much  information 
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as  possible.  The  system  then  identifies  a  set  of  facial  features  (eyes  and 
nose/mouth)  and  uses  a  model  of  human  facial  structure  to  identify  the 
orientation  of  the  person’s  head. 

Mimicry.  Cog’s  torso  was  retrofitted  with  force  sensing  capabilities  in  or¬ 
der  to  implement  body  motion  via  virtual  spring  force  control.  In  addition, 
we  developed  a  representational  language  for  humanoid  motor  control  in¬ 
spired  by  the  neurophysiological  organizing  principle  of  motor  primitives. 
Both  endeavors  allowed  Cog  to  broadly  mimic  the  motions  of  a  person  with 
whom  it  interacts  using  its  body  or  arms.  In  the  arm  imitation  behavior,  the 
robot  continuously  tracks  many  object  trajectories.  A  trajectory  is  selected 
on  the  basis  of  animacy  and  the  attentional  state  of  the  instructor.  Motion 
trajectories  are  then  converted  from  a  visual  representation  to  a  motor  rep¬ 
resentation  which  the  robot  can  execute. The  performance  of  this  mimicry 
response  was  evaluated  with  naive  human  instructors. 

Distinguishing  Animate  from  Inanimate.  We  have  implemented  a  sys¬ 
tem  that  distinguishes  between  the  movement  patterns  of  animate  objects 
from  those  of  inanimate  objects.  This  system  uses  a  multi-agent  architecture 
to  represent  a  set  of  naive  rules  of  physics  that  are  drawn  from  experimental 
results  on  human  subjects.  These  naive  rules  represent  the  effects  of  gravity, 
inertia,  and  other  intuitive  parts  of  Newtonian  mechanics.  We  have  evaluated 
this  system  by  comparing  the  results  to  human  performance  on  classifying 
the  movement  of  point-light  sources,  and  found  the  system  to  be  more  than 
85%  accurate  on  a  test  suite  of  recorded  real-world  data. 

Joint  Reference.  Using  its  new  2-DOF  hands  that  exploit  series  elastic 
actuators  and  rapid  prototyping  technology,  Cog  demonstrated  basic  grasp¬ 
ing  and  gestures.  The  gestural  ability  was  combined  with  models  from  human 
development  for  establishing  joint  reference,  that  is,  for  the  robot  to  attend  to 
the  same  object  that  an  instructor  is  attending  to.  Objects  that  are  within 
the  approximate  attention  range  of  the  human  instructor  are  made  more 
salient  to  the  robot.  Information  from  head  orientation  is  the  primary  cue 
of  attention  in  the  instructor. 

Simulated  Musculature.  Cog’s  arm  and  body  are  controlled  via  sim¬ 
ulated  muscle-like  elements  that  span  multiple  joints  and  operate  indepen- 
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dently.  Muscle  strength  and  fatigue  over  time  are  modulated  by  a  biochemical 
model.  The  muscle-like  elements  are  inspired  by  real  physiology  and  allow 
Cog  to  move  with  dynamics  that  are  more  human-like  than  conventional 
manipulator  control. 

Vocabulary  Management.  Kismet  needs  to  acquire  a  vocabulary  rele¬ 
vant  to  a  human’s  purpose.  Towards  this  goal,  first,  we  have  implemented  a 
command  protocol  for  introducing  vocabulary  to  Kismet.  Second,  we  have 
developed  an  unsupervised  mechanism  for  extracting  candidate  vocabulary 
items  from  natural  continuous  speech.  Third,  we  have  analyzed  the  speech 
used  in  teaching  Kismet  words  in  order  to  determine  whether  humans  natu¬ 
rally  modify  their  speech  in  ways  that  would  enable  better  word  learning  by 
the  robot. 

Head  Pose  Estimation.  We  developed  a  fully  automatic  system  for  re¬ 
covering  the  rigid  components  of  head  pose.  The  conventional  approach  of 
tracking  pose  changes  relative  to  a  reference  configuration  can  give  high  ac¬ 
curacy  but  is  subject  to  drift.  In  face-to-face  interaction  with  a  robot,  there 
are  likely  to  be  frequent  presentations  of  the  head  in  a  close  to  frontal  orien¬ 
tation,  so  we  used  that  to  make  opportunistic  corrections.  Tracking  of  pose 
was  done  in  an  intermediate  mixed  coordinate  system  chosen  to  minimize  the 
impact  of  errors  in  estimates  of  the  3D  shape  of  the  head  being  tracked.  This 
is  vital  for  practical  application  to  unknown  users  in  cluttered  conditions. 

Papers  on  this  work  included:  [2],  [36],  [32],  [33],  [34],  [41],  [53],  [51],  [78], 
[89],  [88],  [87],  [96], 

Presentations  on  this  work  included: 

[3],  [44],  [52],  [75],  [79],  [90],  [91],  [94], 

7  Results  for  2001-2002 

Cog 

Adaptation  of  Arm  Stiffness.  Cog  learns  a  feed  forward  command  force 
function  that  is  dependent  on  arm  posture  but  independent  of  stiffness.  This 
adaptation  of  stiffness  parallels  human  reaching  in  which  there  is  higher  stiff¬ 
ness  at  the  endpoints  and  lower  stiffness  during  the  middle  of  a  reach.  It 
allows  Cog  to  reach  to  points  in  the  arm’s  workspace  with  greater  accuracy, 


gives  Cog  a  more  human-like  range  of  dynamics  and  allows  for  safer  and  more 
intuitive  physical  interaction  with  humans. 

Reflex  Inhibition.  Inhibition  of  extreme  movements  prevents  robotic  fail¬ 
ure.  Cog  uses  learned  reflex  inhibition  for  coordinated  joint  movement  and 
distribution  of  a  movement  over  as  many  degrees  of  freedom  as  possible, 
avoiding  saturation  of  a  few  joints.  During  learning  Cog  explores  the  gross 
limits  of  its  torso  workspace  by  the  action  of  reflexive  movements.  As  it 
reaches  joint  extremities,  a  simulated  pain  model  results  in  modification  of 
a  reflex  to  constrain  its  movements  to  avoid  physically  harming  itself  and  to 
operate  the  torso  primarily  in  a  state  of  balance. 

Dynamic  Configuration  of  Multi-joint  Muscles.  To  facilitate  devel¬ 
opment  of  a  multi-joint  muscle  model  for  controlling  Cog,  a  graphical  user 
interface  (GUI)  displays  the  movement  of  Cog  in  terms  of  Cog’s  muscle  model 
overlaying  Cog’s  joints.  The  muscle  model  is  reconhgurable  at  run  time 
through  the  GUI. 

Hand  Reflex.  Cog’s  two  degrees  of  freedom  hand,  equipped  with  tactile 
sensors,  has  a  reflex  that  grasps  and  extends  in  a  manner  similar  to  primate 
infants.  Contact  inside  the  hand  causes  a  short  term  grasp,  contact  to  the 
back  of  the  hand  causes  an  extensive  stretch. 

Arm  Localization.  It  is  difficult  to  visually  distinguish  the  motion  of  a 
robot’s  own  arm  as  distinct  from  similar  motion  by  humans  or  objects.  Cog 
discovers  and  learns  about  its  own  arm  by  generating  a  motion  and  then 
correlating  the  associated  optic  flow  with  proprioceptive  feedback.  It  ignores 
any  uncorrelated  movements  and  visual  data.  Once  Cog  can  track  its  own 
arm,  when  it  contacts  an  object,  it  discounts  its  own  movement  in  order  to 
isolate  object  properties. 

Object  Tapping  for  Segmentation.  There  are  cases  when  solely  visual 
based  object  segmentation  poorly  or  completely  fails  to  disambiguate  an  ob¬ 
ject  from  its  background.  Cog  can  determine  the  shape  of  simple  objects 
by  tapping  them.  This  physical  experimentation  augments  visual  based  seg¬ 
mentation. 
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Mirror-Neuron  Model.  Cog  is  able  to  perform  manipulative  actions: 
poking  an  object  away  from  its  body  and  poking  an  object  towards  itself. 


It  uses  its  attentional  system  to  locate  and  fixate  an  object  and  its  track¬ 
ing  system  to  follow  the  object  trajectory.  It  maps  visual  perception  into  a 
sequence  of  motor  commands  to  engage  the  object.  These  abilities:  vision 
driven  manipulation  and  mapping  perception  to  action  are  prerequisites  of  a 


On  the  left,  the  robot  establishes  a  causal  connection  between  commanded 
motion  and  its  own  manipulator,  and  then  probes  its  manipulator’s  effect  on 
an  object.  The  object  then  serves  as  a  literal  “point  of  contact”  to  link  robot 
manipulation  with  human  manipulation  (on  the  right),  as  is  required  for  a 
mirror-  neuron-  like  represent  at  ion . 

Module  Integration.  Cog  has  a  modular  architecture  with  components 
responsible  for  sensing,  acting  and  processing  higher  level  aspects  of  vision 
and  manipulation.  Cog  integrates  modules  responsible  for  14  degrees  of  free¬ 
dom  (head,  torso  and  arm  axes)  in  order  to  reach  out  and  poke  an  object. 
It  coordinates  its  head  control  and  arm  control  with  its  visual  attention, 
tracking,  and  arm  localization  subsystems. 

Face  Tracking.  Cog’s  attentional  system  is  updated  with  an  imported 
face  detector  that  has  greater  accuracy.  The  detector  is  coupled  with  a  face 
tracker  that  copes  with  non-frontal  face  presentations  despite  the  detector 
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operating  slower  than  frame  rate.  The  combined  systems  allow  Cog  to  engage 
in  tasks  requiring  shared  attention  and  human-robot  interaction. 

M4 

7.1  Macaco. 

The  M4  robot  consists  of  an  active  vision  robotic  head  integrated  with  a 
Magellan  mobile  platform.  The  robot  integrates  vision-based  navigation  with 
human-robot  interaction.  It  operates  a  portable  version  of  the  attentional 
systems  of  Cog  and  Lazio  with  specific  customization  for  a  thermal  camera. 
Navigation,  social  preferences  and  protection  of  self  are  fulfilled  with  a  model 
of  motivational  drives.  Multi-tasking  behaviors  such  as  night  time  object 
detection,  thermal-based  navigation,  heat  detection,  obstacle  detection  and 
object  reconstruction  are  based  upon  a  competition  model. 

Kismet 

Dynamic  Subjective  Response.  Kismet  has  the  ability  to  learn  to  rec¬ 
ognize  and  remember  people  it  interacts  with.  Such  social  competence  leads 
to  complex  social  behavior,  such  as  cooperation,  dislike  or  loyalty.  Kismet 
has  an  online  and  unsupervised  face  recognition  system,  where  the  robot 
opportunistically  collects,  labels,  and  learns  various  faces  while  interacting 
with  people,  starting  from  an  empty  database. 

Proto-linguisitc  Capabilities.  Kismet  uses  utterances  as  a  way  to  ma¬ 
nipulate  its  environment  through  the  beliefs  and  actions  of  others.  It  has  a 
vocal  behavior  system  forming  a  pragmatic  basis  for  higher  level  language  ac¬ 
quisition.  Protoverbal  behaviors  are  influenced  by  the  robot’s  current  percep¬ 
tual,  behavioral  and  emotional  states.  Novel  words  (or  concepts)  are  created 
and  managed.  The  vocal  label  for  a  concept  is  acquired  and  updated. 
Papers  on  this  work  included: 

[28],  [68],  [55],  [81],  [83],  [97],  [95], 

Presentations  on  this  work  included: 

[6],  [45],  [54],  [80], 
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8  Results  for  2002-2003 

Guided  Training  via  a  Modular  Software  System  for  Learning  from 
Interaction  with  the  Environment  and  People.  Cog  learns  simple  arm 

and  end  effector  tasks  via  a  combination  of  self-exploration  and  explicit  train¬ 
ing.  With  tactile  reinforcement  signals,  Cog  is  taught  by  a  human  trainer 
to  perform  simple  postural  arm  and  hand  actions.  Subsequently,  the  trainer 
teaches  the  robot  to  perform  such  learned  actions  in  response  to  tactile  (touch 
to  particular  fingers)  and  visual  (objects  of  particular  colors)  stimuli. 

Expoiting  a  Model  of  Muscle  Fatique  for  Human-like  Movement. 

Cog  has  a  fatigue  model  for  its  virtual  musculature.  This  simulation  of 
biological  muscle  fatigue  provided  signals  that  modulated  motor  performance 
and  provided  negative  reinforcement  to  the  learning  module  to  guide  the 
acquisition  of  more  natural  human-like  motor  movement. 

Learning  How  Joints  Move  in  Relation  to  Virtual  Muscle  Groups. 

Starting  simply,  from  an  inclination  to  randomly  move  its  virtual  muscles, 
Cog  learns  to  activate  its  muscle  model  so  it  can  move  to  particular  points  in 
joint  angle  space.  Cog  acquires  an  unsupervised  linear  dependency  model  be¬ 
tween  joint  velocities  and  controller  modules  that  supervise  multiple  muscles 
in  combination. 

Active  segmentation.  Cog  uses  active  exploration  to  resolve  visual  am¬ 
biguity  in  its  workspace.  Objects  can  sometimes  be  difficult  to  locate  if  their 
visual  appearance  is  similar  to  the  general  background.  Cog  solves  this  prob¬ 
lem  by  sweeping  its  arm  through  regions  of  interest.  If  no  object  is  there,  the 
arm  passes  unimpeded.  If  an  object  is  present,  the  impact  between  it  and 
the  robot’s  arm  causes  the  object  to  move,  revealing  its  boundary. 

Cog  uses  a  mirror  neuron  model  to  learn  how  different  objects 
respond  to  the  actions  it  can  perform.  If  the  robot  taps  an  object  and 
it  slips  and  rolls,  it  learns  to  predict  the  direction  of  slip  based  on  visaul 
evidence,  and  can  then  use  that  information  to  deliberately  trigger  or  avoid 
rolling  an  object  while  tapping  it.  The  mirror  neuron  model  allows  the  robot 
to  mimic  an  action  demonstrated  by  a  human  relative  to  the  natural  behavior 
of  the  object,  rather  than  pure  geometry. 
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Open  object  recognition.  With  open  object  recognition,  the  set  of  ob¬ 
jects  Cog  can  recognize  grows  over  time,  as  it  accumulates  experience  through 
active  segmentation  and  other  experimental  methods.  The  robot  clusters 
episodes  of  object  interaction  to  learn  the  properties  of  novel,  unfamiliar 
objects.  An  operator  can  introduce  names  for  objects  to  facilitate  further 
task-related  communication. 

Perceptual  cycle.  Cog  uses  the  constraints  of  known  activities  to  learn 
about  the  objects  used  within  those  activities  -  for  example,  during  manip¬ 
ulation.  Cog  can  track  known  objects  to  learn  about  activities  they  occur 
in,  such  as  a  sorting  task  or  object  search.  By  combining  the  ability  to  learn 
about  objects  through  activity  constraints  and  activities  through  tracking 
objects,  the  robot  can  achieve  a  virtuous  cycle  of  perception. 

Adaptive  control  of  Cog’s  arm  using  a  nonlinear  sliding-modes  con¬ 
troller.  Two  degrees  of  freedom  on  Cog’s  arm  operate  via  non-parametric 
adaptive  control  using  a  nonlinear  sliding-modes  controller.  This  sufficiently 
mitigates  the  high  signal  to  noise  ratio  arising  in  Cog’s  arm  (due  to  a  small 
strain  gauge  signal  that  experiences  capacitive  coupling  with  other  signals) 
and  allows  semi-autonomous,  task  adequate  control. 

Learning  actions  and  objects  from  observed  use.  While  Cog  watches 
an  event  involving  someone’s  arm  handling  an  object  (e.g.  filing  a  surface, 
swinging  a  pendulum),  its  vision  system  extracts  both  the  nature  of  the  arm 
movement  and  derives  a  predictive  dynamical  model  of  the  object. 

A  compact  linear  series  elastic  actuator  design  for  human-like  neck 
joint.  For  a  new  robotic  head,  two  new  coupled  neck  axes  were  designed 
and  built  using  linear  series  clastic  actuators  aligned  in  parallel.  The  design 
is  compact:  the  two  axes  have  intersecting  centers  of  rotation.  Force  control 
in  combination  with  clastic  actuation  provides  safe,  human  like  compliancy. 

The  ALIVE  architecture.  The  ALIVE  architecture  consisting  of  a  stack 
and  the  CreaL  software  development  environment  controls  the  new  robotic 
head.  The  stack  is  a  special  purpose,  extensible,  real-time,  small  form-factor 
hardware  architecture  of  controller  boards,  sensor  boards,  network  board, 
and  off-the-shelf  processor.  CreaL,  which  is  retargetable,  extracts  efficient 
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computational  power  to  allow  many  lightweight  threads  from  the  relatively 
cheap  off-the-shelf  processor  via  efficient  software  scheduling,  compilation 
and  language  abstraction.  The  ALIVE  architecture  facilitates  complete  de¬ 
signer  control  over  startup  and  failure  sequences  which  is  essential  for  con¬ 
tinuous,  safe  robot  operation. 

Papers  on  this  work  included:  [8],  [9],  [11],  [12],  [10],  [7],  [57],  [58],  [69], 
[60],  [62],  [63],  [67],  [70],  [72],  [76],  [77],  [82], 

Presentations  on  this  work  included: 

[5],  [56],  [61],  [64],  [59]. 

9  Results  for  2003-2004 

Accomplishments  are  on  7  robotic  platforms:  Cog,  Cardea,  Coco,  a  robot 
head  named  Mertz,  a  new  humanoid  named  Domo,  a  human  wearable/hybrid 
system  named  Duo  and  an  unnamed  5  DOF  hand.  This  work  was  done 
between  July  2003  and  July  2004. 

The  ’Yet  Another  Robot  Platform’  open  software  library  is  used  on 
6  platforms.  Software  written  in  C  and  C++  that  provides  routines  for 
robot  platform  development  in  terms  of  inter-process  communication,  vision 
and  control  and  has  operating  system  services  support  for  Windows  NT  and 
QNX4  and  QNX6  is  running  on  multiple  robotic  platforms  at  MIT:  Cog, 
Coco,  Domo,  Mertz,  Cardea  and  Duo,  and  in  Europe. 

Door  Shoving  by  A  Self-Balancing  Mobile  Humanoid.  Cardea  ,  a 
prototype  humanoid  based  on  a  Segway  RMP  extended  with  contact  and 
IR  sensing,  simple  torso  and  3  DOF  arm  manipulation,  vision  using  1  fixed 
mounted  camera  and  a  single  DOF  camera  and  entirely  on-board  computa¬ 
tion,  can  navigate  an  office  corridor,  find  a  partially  closed  door,  shove  it 
open  and  pass  through. 

Emergency  Kickstands  within  Safety  System  for  Segway  RMP  base. 

Two  emergency  kickstands  for  Cardea  deploy  when  a  ’sniffer’  detects  soft¬ 
ware  definable  error  conditions  indicating  the  platform  is  falling  over.  They 
are  part  of  a  complete  safety  system  that  overrides  robotic  control  when 
the  RMP  over-tilts.  First,  the  system  relies  on  RMP  self-balancing.  When 
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self-balancing  fails,  the  kickstands  eject.  Safety  is  also  ensured  via  radio 
controlled  Emergency  stop  (E-stop). 

A  Lightweight  Computational  Hardware  Architecture  Supporting 
Humanoid  Mobility  and  Manipulation.  A  computational  hardware  ar¬ 
chitecture  consisting  of  a  network  of  distributed,  onboard  lightweight  8-bit 
computational  elements  that  supports  behavior,  sensorimotor  and  RMP  con¬ 
trollers,  power  circuitry  and  debugging  demonstrates  humanoid  navigation 
and  manipulation. 

A  Prototype  Camera-Arm  Platform  Integrating  a  Visual  System 
and  a  Motor  System  Running  on  an  Embedded  Architecture  The 

design  of  embedded  brushless  motor  amplifiers,  DSP  motor  controllers  and 
sensor  conditioning  is  integrated  with  the  ALIVE  hardware  and  software 
architecture.  A  5  DOF  force  controllable  prototype  arm,  with  series  elastic 
actuators,  a  differentially  driven  shoulder  and  a  virtually  centered  elbow, 
runs  on  the  embedded  architecture  using  virtual  spring  control  and  a  ’CreaL’ 
(creature  language)  behavioral  controller.  It  can  track  in  conjunction  with 
a  2  DOF  active  vision  system  running  on  a  laptop.  It  can  reach  towards 
and  poke  an  object  using  visual  and  color  information  and  estimating  the 
position  of  its  hand  via  forward  kinematics  in  visual  coordinate  space. 

A  Creature-based  Approach  to  Robotic  Existence.  Mertz,  an  active- 
vision  humanoid  head  platform,  fulfills  an  immediate  goal  of  running  continu¬ 
ously  for  days  without  supervision  at  a  variety  of  locations.  Mertz  is  designed 
with  fault  prevention  strategies  in  mind,  It  can  instantly  startup  and  perform 
joint  calibration.  It  has  circuitry  to  protect  against  power  cycles  and  abrupt 
shutdown.  Its  vision  system  is  adaptable  to  different  lighting  conditions  and 
backgrounds. 

Domo:  A  Force  Sensing  and  Compliant  Humanoid  Platform.  Com¬ 
pleted  the  design,  fabrication  and  assembly  of  a  new  force  sensing  and  com¬ 
pliant  humanoid  platform,  named  Domo,  for  exploring  general  dextrous  ma¬ 
nipulation,  visual  perception  and  learning.  Domo  incorporates  force  sensors 
and  compliance  in  most  of  its  joints  to  act  safely  in  an  unstructured  environ¬ 
ment.  It  consists  of  a  two  6  DOF  arms,  two  4  DOF  hands,  a  7  DOF  head,  a 
2  DOF  neck,  58  proprioceptive  sensors  and  24  tactile  sensors.  Twenty-four 
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DOF  use  force  controlled  compliant  actuators.  Its  realtime  sensorimotor 
system  is  managed  by  an  embedded  network  of  DSP  controllers.  Its  vision 
system,  which  (2  cameras,  3  DOF)  utilizes  the  YARP  software  library,  and 
its  cognitive  system  run  on  a  small,  networked  cluster  of  PC’s. 

Overcoming  Mechanical  Modes  of  Failure.  Domo  achieves  mechanical 
robustness:  geartrain  failures  are  mitigated  by  using  ball  screws  and  elastic 
spring  elements,  motor  winding  reheating  is  avoided  by  current  limits  in 
its  brushless  DC  motor  amplifiers  and  prevention  of  stall  currents,  cable 
breakage  and  wire  strain  susceptibility  have  been  reduced,  and  maintenance 
is  easier  by  the  design  of  modular  subsystems. 

Two  Force  Controlled  Arms.  Domo’s  arm  design  focuses  on  force  con¬ 
trol.  An  arm  is  passively  or  actively  compliant  and  able  to  directly  sense 
and  command  torques  at  each  joint.  This  design  forgoes  the  conventional 
emphasis  on  end  effector  stiffness  and  precision  to,  instead,  mimic  human 
capabilities.  It  relies  on  advanced  linear  Series  Elastic  Actuators. 

A  Robust  Multi-Layered  Sensorimotor  and  Cognitive  System.  Domo 
has  been  designed  with  four  layers  of  sensorimotor  and  cognitive  systems: 
physical  for  sensors,  motors  and  interface  electronics,  DSP  for  real  time  con¬ 
trol,  a  sensorimotor  abstraction  layer  for  interfacing  between  the  DSP  and 
cognitive  layers,  and  a  cognitive  layer.  It  emphasizes  robustness  to  common 
modes  of  failure,  real-time  control  of  time  critical  resources  and  expandable 
computational  capability.  This  runs  on  a  combination  of  special  purpose 
embedded  hardware  communicating  through  a  CAN  bus  or  Firewire,  in  the 
case  of  cameras,  to  a  cluster  of  Linux  nodes. 

Advanced  Design  of  Elastic  Force  Sensing  Actuators  with  Embed¬ 
ded  Amplifiers.  Design  of  new  version  of  SEA  using  a)  linear  ball  screws 
for  greater  efficiency  and  shock  tolerance  b)  a  cable  drive  transmission  al¬ 
lowing  actuator  mass  to  be  moved  far  from  the  end  point  reducing  energy 
consumption  and  hence  needing  lower  wattage  motors,  plus  allowing  modular 
and  standardized  packaging  implying  easier  maintenance  and  reuse.  A  novel 
force  sensing  compliant  (FSC)  actuator  places  the  spring  element  between 
the  motor  housing  and  the  chassis  ground  which  allows  continuous  rotation 
at  the  motor  output.  The  FSC  actuator  is  compact  due  to  use  of  torsion 
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springs.  Embedded  custom  brushless  motor  amplifiers  and  sensory  signal 
amplifiers  that  reduce  wiring  run-length  and  thus  simplify  cable  routing  and 
lead  to  better  robustness  are  incorporated. 

A  5  DOF  Sensor  Rich  hand  with  Series  Elastic  Actuation.  Design, 
fabrication  and  assembly  a  5  DOF  sensor  rich  hand  with  simple,  scalable 
force  actuators  .  Three  fingers  with  8  force  sensing  axes  and  5  position  sen¬ 
sors  ,  each  consisting  of  2  coupled  and  decoupling  links  driven  by  a  compact, 
inexpensive  rotary  series  elastic  actuator  which  makes  the  hand  mechani¬ 
cally  compliant  and  force  controllable.  The  last  two  links  of  each  finger  are 
equipped  with  dense  arrays  of  force  sensing  resistors. 

DUO:  A  Human/ Wearable  Hybrid  for  Learning  About  Common 
Manipulable  Objects.  Duo  consists  of  a  glasses  mounted  digital  camera 
connected  to  a  backpack  holding  a  laptop  which  communicates  wirelessly 
to  a  computer  cluster.  It  also  has  four  orientation  sensors  that  are  head, 
wrist,  upper  arm  and  torso  mounted.  Duo  passively  and  actively  observes  the 
manipulation  of  objects  in  natural,  unconstrained  environments.  It  measures 
the  kinematic  configuration  of  its  wearer’s  head,  torso  and  dominant  arm 
while  watching  its  wearer’s  workspace  through  a  head  mounted  camera.  It 
requests  helpful  actions  from  its  wearer  through  speech  via  headphones.  It 
can  segment  common  manipulable  objects  with  high  quality. 

Using  Cast  Shadows  for  Visually-Guided  Touching.  The  shadow 
cast  by  a  robot’s  own  body  is  used  to  help  direct  its  arm  towards,  across, 
and  away  from  an  unmodeled  surface  without  damaging  it.  The  shadow  is 
detected  by  a  camera  and  used  to  derive  a  time-to-contact  estimate  which, 
when  combined  with  the  2D  tracked  location  of  the  arm’s  endpoint  in  the 
camera  image  is  sufficient  to  allow  3D  control  relative  to  the  surface. 

Exploiting  Amodal  Cues  for  Robot  Perception.  Rhythmically  mov¬ 
ing  objects,  such  as  tools  and  toys,  are  detected,  segmented  and  recognized  by 
the  sounds  they  generate  as  they  move.  This  method  does  not  require  accu¬ 
rate  sound  localization  but  can  complement  that  information.  It  is  selective 
and  robust  in  the  face  of  distracting  motion  and  sounds.  This  perceptual  tool 
is  required  for  a  robot  to  learn  to  use  tools  and  toys  through  demonstration. 
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Object  Segmentation  by  Demonstration.  A  human  teaches  Cog  how 
to  segment  objects  from  arbitrarily  complex  non-static  images  by  waving  the 
object  to  introduce  it.  An  algorithm  detects  the  skin  color  of  the  human’s 
arm,  and  tracks  its  motion.  Then  the  object’s  compact  cover  is  extracted 
using  the  periodic  trajectory  information. 

Figure/Ground  Segmentation  from  Human  Cues.  In  order  to  in¬ 
fer  large  scale  depth  and  build  3-dimensional  maps,  Cog  exploits  its  human 
helper’s  arm  as  a  reference  measure  while  measuring  the  relative  size  of  ob¬ 
jects  on  a  monocular  image.  It  is  also  able  to  perform  figure/ground  segre¬ 
gation  on  typical  heavy  objects  in  a  scene,  such  as  furniture  and  perform  3D 
object  and  scene  reconstruction.  This  argues  for  solving  a  visual  problem 
not  simply  by  controlling  the  perceptual  system,  but  actively  changing  the 
environment  through  experimental  manipulation. 

A  Learning  Framework  for  a  Humanoid  Robot  Inspired  by  Devel¬ 
opmental  Learning.  For  Cog  to  learn  about  its  physical  surroundings,  a 
human  helps  Cog  to  correlate  its  own  senses,  to  control  and  integrate  sit¬ 
uational  cues  from  its  surrounding  world  and  to  learn  about  out-of-reach 
objects  and  the  different  representations  in  which  they  might  appear.  The 
strategies  for  this  learning  are  inspired  by  child  development  theory  which 
defines  a  separation  and  individuation  developmental  phase. 

On-line  Parameter  Tuning  of  Neural  Oscillators.  Cog  employs  neural 
oscillators  in  its  arm  that  are  capable  of  adapting  to  the  dynamics  of  the  arm’s 
controlled  system.  After  using  a  time-domain  analysis  to  intuitively  tune  the 
parameters  of  neural  oscillators,  Cog  plays  a  rhythmic  musical  instrument 
such  as  a  drum  or  tamborine. 

Learning  Task  Sequences  from  Scratch.  Task  sequencing  requires  rec¬ 
ognizing  an  object,  identifying  it  with  some  associated  action  then  learning 
the  sequence  of  events  and  objects  that  characterize  the  task.  For  example, 
a  saw  must  be  recognized  and  moved  back  and  forth  on  the  correct  plane  to 
complete  the  task  of  sawing.  Cog  can  learn  task  sequences  from  human-robot 
interaction  cues.  A  human  teaches  the  robot  new  objects  such  as  tools  and 
toys  and  their  functionality.  The  robot  explores  the  world  and  extends  its 


18 


knowledge  of  the  objects’  properties.  It  acquires  recognition  of  multi-modal 
percepts  by  manipulating  the  tools  and  toys. 

Papers  on  this  work  included:  [13],  [20],  [15],  [22],  [21],  [24],  [25],  [26], 
[19],  [16],  [23],  [14],  [17],  [IS],  [50],  [49],  [46],  [27],  [29],  [65],  [66],  [71],  [73], 
[93], 

Presentations  on  this  work  included:  [92],  [74], 
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